{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![huggingface](https://huggingface.co/favicon.ico)\n",
    "\n",
    "# HuggingFace Tutorial - Sequence Classification with DistilBERT and PyTorch\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "* A week ago, I decided to learn more about NLP beacuse my previous year was mainly focused on Computer Vision applications and I couldn't put much time on NLP. I watched NVIDIA GrandMaster Series episode \"[Grandmaster Series – Building World-Class NLP Models with Transformers and Hugging Face](https://youtu.be/PXc_SlnT2g0)\" and I realized that this amazing **HuggingFace** library makes it really easy to use state-of-the-art models and get perfect results. I was also afraid of **Transformers** because I thought they are too complicated and it's not easy to understand them! But, I was totally wrong! I watched a bunch of good tutorials on Transformers and how to code them on YouTube which I'm going to introduce them bellow. I also share some of the good tutorials on HuggingFace itself which I found there:\n",
    "\n",
    "\n",
    "1. Pytorch Transformers from Scratch (Attention is all you need):\n",
    "[YouTube Link](https://youtu.be/U0s0f995w14)\n",
    "2. Grandmaster Series – Building World-Class NLP Models with Transformers and Hugging Face: [YouTube Link](https://youtu.be/PXc_SlnT2g0)\n",
    "3. Deep learning for (almost) any text classification problem (binary, multi-class, multi-label): [YouTube Link](https://youtu.be/oreIJQZ40H0)\n",
    "\n",
    "I also found HuggingFace Official Examples really helpful: [Link](https://huggingface.co/transformers/examples.html)\n",
    "\n",
    "---\n",
    "\n",
    "Although you can watch them and be good to go with your NLP/Transformer journey, I though it will be helpful to make a tutorial on using HuggingFace models based on the things I've learned so far and make it easier to start this journey for others; because, some of the details are missing in these tutorials and I'm gonna focus more on them in this one. So, stay tuned!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:5: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
      "  \"\"\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.2.2\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import copy\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from tqdm.autonotebook import tqdm\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "\n",
    "from sklearn.model_selection import train_test_split, KFold\n",
    "\n",
    "# importing HuggingFace transformers library which is all we need to get SOTA results :)\n",
    "import transformers\n",
    "from transformers import get_linear_schedule_with_warmup\n",
    "\n",
    "print(transformers.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building A Custom PyTorch Dataset\n",
    "\n",
    "* One important thing that I was looking for was how to build an efficient PyTorch Dataset from my own data (actually, Kaggle data in this case!). Because, in the [HuggingFace official examples](https://huggingface.co/transformers/examples.html) they were using their own datasets library with ready-to-use datasets but most of the time, we need to build our own datasets with our own data. \n",
    "\n",
    "* So, I searched and found this amazing short tutorial from HuggingFace: [Fine-tuning with custom datasets](https://huggingface.co/transformers/custom_datasets.html). The following code uses the idea from this tutorial on building a custom dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TweetDataset(torch.utils.data.Dataset):\n",
    "    def __init__(self, dataframe, tokenizer, mode=\"train\", max_length=None):\n",
    "        self.dataframe = dataframe\n",
    "        if mode != \"test\":\n",
    "            self.targets = dataframe['target'].values\n",
    "        texts = list(dataframe['text'].values)\n",
    "        self.encodings = tokenizer(texts, \n",
    "                                   padding=True, \n",
    "                                   truncation=True, \n",
    "                                   max_length=max_length)\n",
    "        self.mode = mode\n",
    "        \n",
    "        \n",
    "    def __getitem__(self, idx):\n",
    "        # putting each tensor in front of the corresponding key from the tokenizer\n",
    "        # HuggingFace tokenizers give you whatever you need to feed to the corresponding model\n",
    "        item = {key: torch.tensor(values[idx]) for key, values in self.encodings.items()}\n",
    "        # when testing, there are no targets so we won't do the following\n",
    "        if self.mode != \"test\":\n",
    "            item['labels'] = torch.tensor(self.targets[idx])\n",
    "        return item\n",
    "    \n",
    "    def __len__(self):\n",
    "        return len(self.dataframe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just a wrapper to easier build the Dataset and DataLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_loaders(dataframe, tokenizer, mode=\"train\", max_length=None):\n",
    "    dataset = TweetDataset(dataframe, tokenizer, mode, max_length=max_length)\n",
    "    dataloader = torch.utils.data.DataLoader(dataset, \n",
    "                                             batch_size=options.batch_size, \n",
    "                                             shuffle=True if mode == \"train\" else False,\n",
    "                                             num_workers=options.num_workers)\n",
    "    return dataloader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Custom Classification Model based on DistilBERT\n",
    "\n",
    "* This part needs some explanation. As the title said in the beginning of this tutorial, we are going to use DistilBERT model. But as you might have guessed, DistilBERT is a Language Model which needs to be fine-tuned on a final task of interestl; here being Classification. For those of you that are familiar with Computer Vision, it's like using a fancy ResNet model pre-trained on ImageNet and then building a custom head for our specific task!\n",
    "\n",
    "* So, we need to build that custom head here. Before doing so, we need to know something about BERT family models (I recommend to study [original BERT paper](https://arxiv.org/abs/1810.04805)). In the paper, they introduce some special tokens named [CLS] and [SEP] which they add to the sequence which is being fed to the model. [CLS] is used at the beginning of the sequence and [SEP] tokens are used to notify the end of each part in a sequence (a sequence which is going to be fed to BERT model can be made up of two parts; e.x question and corresponding text). \n",
    " \n",
    "* In the paper they explain that they use [CLS] hidden state representation to do classification tasks for the sequence. So, in our case, we are going to the same. DistilBERT model will produce a vector of size 768 as a hidden representation for this [CLS] token and we will give it to some nn.Linear layers to do our own specific task. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CustomModel(nn.Module):\n",
    "    def __init__(self,\n",
    "                 bert_model,\n",
    "                 num_labels, \n",
    "                 bert_hidden_dim=768, \n",
    "                 classifier_hidden_dim=768, \n",
    "                 dropout=None):\n",
    "        \n",
    "        super().__init__()\n",
    "        self.bert_model = bert_model\n",
    "        # nn.Identity does nothing if the dropout is set to None\n",
    "        self.head = nn.Sequential(nn.Linear(bert_hidden_dim, classifier_hidden_dim),\n",
    "                                  nn.ReLU(),\n",
    "                                  nn.Dropout(dropout) if dropout is not None else nn.Identity(),\n",
    "                                  nn.Linear(classifier_hidden_dim, num_labels))\n",
    "    \n",
    "    def forward(self, batch):\n",
    "        # feeding the input_ids and masks to the model. These are provided by our tokenizer\n",
    "        output = self.bert_model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])\n",
    "        # obtaining the last layer hidden states of the Transformer\n",
    "        last_hidden_state = output.last_hidden_state # shape: (batch_size, seq_length, bert_hidden_dim)\n",
    "        # As I said, the CLS token is in the beginning of the sequence. So, we grab its representation \n",
    "        # by indexing the tensor containing the hidden representations\n",
    "        CLS_token_state = last_hidden_state[:, 0, :]\n",
    "        # passing this representation through our custom head\n",
    "        logits = self.head(CLS_token_state)\n",
    "        return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training and Evaluation functions\n",
    "\n",
    "* There is nothing NLP/Transformer specific here! Just some functions to the training and eval loops and print stuff while the model is being trained\n",
    "\n",
    "* Pay attention to the comments in the codes below; I've explained the parts that could be confusing or new to you!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AvgMeter:\n",
    "    def __init__(self, name=\"Metric\"):\n",
    "        self.name = name\n",
    "        self.reset()\n",
    "    \n",
    "    def reset(self):\n",
    "        self.avg, self.sum, self.count = [0]*3\n",
    "    \n",
    "    def update(self, val, count=1):\n",
    "        self.count += count\n",
    "        self.sum += val * count\n",
    "        self.avg = self.sum / self.count\n",
    "    \n",
    "    def __repr__(self):\n",
    "        text = f\"{self.name}: {self.avg:.4f}\"\n",
    "        return text\n",
    "\n",
    "def one_epoch(model, criterion, loader, device, optimizer=None, lr_scheduler=None, mode=\"train\", step=\"batch\"):\n",
    "    loss_meter = AvgMeter()\n",
    "    acc_meter = AvgMeter()\n",
    "    \n",
    "    tqdm_object = tqdm(loader, total=len(loader))\n",
    "    for batch in tqdm_object:\n",
    "        batch = {k: v.to(device) for k, v in batch.items()}\n",
    "        preds = model(batch)\n",
    "        loss = criterion(preds, batch['labels'])\n",
    "        if mode == \"train\":\n",
    "            optimizer.zero_grad()\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            if step == \"batch\":\n",
    "                lr_scheduler.step()\n",
    "                \n",
    "        count = batch['input_ids'].size(0)\n",
    "        loss_meter.update(loss.item(), count)\n",
    "        \n",
    "        accuracy = get_accuracy(preds.detach(), batch['labels'])\n",
    "        acc_meter.update(accuracy.item(), count)\n",
    "        if mode == \"train\":\n",
    "            tqdm_object.set_postfix(loss=loss_meter.avg, accuracy=acc_meter.avg, lr=get_lr(optimizer))\n",
    "        else:\n",
    "            tqdm_object.set_postfix(loss=loss_meter.avg, accuracy=acc_meter.avg)\n",
    "    \n",
    "    return loss_meter, acc_meter\n",
    "\n",
    "def get_lr(optimizer):\n",
    "    for param_group in optimizer.param_groups:\n",
    "        return param_group[\"lr\"]\n",
    "\n",
    "def get_accuracy(preds, targets):\n",
    "    \"\"\"\n",
    "    preds shape: (batch_size, num_labels)\n",
    "    targets shape: (batch_size)\n",
    "    \"\"\"\n",
    "    preds = preds.argmax(dim=1)\n",
    "    acc = (preds == targets).float().mean()\n",
    "    return acc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_eval(epochs, model, train_loader, valid_loader, \n",
    "               criterion, optimizer, device, options, lr_scheduler=None):\n",
    "    \n",
    "    best_loss = float('inf')\n",
    "    best_model_weights = copy.deepcopy(model.state_dict())\n",
    "    \n",
    "    for epoch in range(epochs):\n",
    "        print(\"*\" * 30)\n",
    "        print(f\"Epoch {epoch + 1}\")\n",
    "        current_lr = get_lr(optimizer)\n",
    "        \n",
    "        model.train()\n",
    "        train_loss, train_acc = one_epoch(model, \n",
    "                                          criterion, \n",
    "                                          train_loader, \n",
    "                                          device,\n",
    "                                          optimizer=optimizer,\n",
    "                                          lr_scheduler=lr_scheduler,\n",
    "                                          mode=\"train\",\n",
    "                                          step=options.step)                     \n",
    "        model.eval()\n",
    "        with torch.no_grad():\n",
    "            valid_loss, valid_acc = one_epoch(model, \n",
    "                                              criterion, \n",
    "                                              valid_loader, \n",
    "                                              device,\n",
    "                                              optimizer=None,\n",
    "                                              lr_scheduler=None,\n",
    "                                              mode=\"valid\")\n",
    "        \n",
    "        if valid_loss.avg < best_loss:\n",
    "            best_loss = valid_loss.avg\n",
    "            best_model_weights = copy.deepcopy(model.state_dict())\n",
    "            torch.save(model.state_dict(), f'{options.model_path}/{options.model_save_name}')\n",
    "            print(\"Saved best model!\")\n",
    "        \n",
    "        # or you could do: if step == \"epoch\":\n",
    "        if isinstance(lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):\n",
    "            lr_scheduler.step(valid_loss.avg)\n",
    "            # if the learning rate changes by ReduceLROnPlateau, we are going to\n",
    "            # reload our previous best model weights and start from there with a lower LR\n",
    "            if current_lr != get_lr(optimizer):\n",
    "                print(\"Loading best model weights!\")\n",
    "                model.load_state_dict(torch.load(f'{options.model_path}/{options.model_save_name}', \n",
    "                                                 map_location=device))\n",
    "        \n",
    "\n",
    "        print(f\"Train Loss: {train_loss.avg:.5f}\")\n",
    "        print(f\"Train Accuracy: {train_acc.avg:.5f}\")\n",
    "        \n",
    "        print(f\"Valid Loss: {valid_loss.avg:.5f}\")\n",
    "        print(f\"Valid Accuracy: {valid_acc.avg:.5f}\")\n",
    "        print(\"*\" * 30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Options:\n",
    "    model_name = 'distilbert-base-uncased'\n",
    "    batch_size = 64\n",
    "    num_labels = 2\n",
    "    epochs = 10\n",
    "    num_workers = 2\n",
    "    learning_rate = 3e-5\n",
    "    scheduler = \"ReduceLROnPlateau\"\n",
    "    patience = 2\n",
    "    dropout = 0.5\n",
    "    model_path = \".\"\n",
    "    max_length = 140\n",
    "    model_save_name = \"model.pt\"\n",
    "    n_folds = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Taking care of Cross Validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_folds(dataframe, n_splits=5):\n",
    "    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)\n",
    "    for i, (_, valid_idx) in enumerate(kf.split(X=dataframe['id'])):\n",
    "        dataframe.loc[valid_idx, 'fold'] = i\n",
    "    return dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_fold(fold, options):  \n",
    "    print(f\"Training Fold: {fold}\")\n",
    "    \n",
    "    # Here, we load the pre-trained DistilBERT model from transformers library\n",
    "    bert_model = transformers.DistilBertModel.from_pretrained(options.model_name)\n",
    "    # Loading the corresponding tokenizer from HuggingFace by using AutoTokenizer class.\n",
    "    tokenizer = transformers.AutoTokenizer.from_pretrained(options.model_name, use_fast=True)\n",
    "    \n",
    "    dataframe = pd.read_csv(\"./input/train.csv\")\n",
    "    dataframe = make_folds(dataframe, n_splits=options.n_folds)\n",
    "    train_dataframe = dataframe[dataframe['fold'] != fold].reset_index(drop=True)\n",
    "    valid_dataframe = dataframe[dataframe['fold'] == fold].reset_index(drop=True)\n",
    "\n",
    "    train_loader = make_loaders(train_dataframe, tokenizer, \"train\", options.max_length)\n",
    "    valid_loader = make_loaders(valid_dataframe, tokenizer, \"valid\", options.max_length)\n",
    "\n",
    "    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "    model = CustomModel(bert_model, options.num_labels, dropout=options.dropout).to(device)\n",
    "    optimizer = torch.optim.Adam(model.parameters(), lr=options.learning_rate)\n",
    "    if options.scheduler == \"ReduceLROnPlateau\":\n",
    "        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, \n",
    "                                                                  mode=\"min\", \n",
    "                                                                  factor=0.5, \n",
    "                                                                  patience=options.patience)\n",
    "        \n",
    "        # when to step the scheduler: after an epoch or after a batch\n",
    "        options.step = \"epoch\"\n",
    "        \n",
    "    elif options.scheduler == \"LinearWarmup\":\n",
    "        num_train_steps = len(train_loader) * options.epochs\n",
    "        lr_scheduler = get_linear_schedule_with_warmup(optimizer, \n",
    "                                                       num_warmup_steps=0, \n",
    "                                                       num_training_steps=num_train_steps)\n",
    "        \n",
    "        # when to step the scheduler: after an epoch or after a batch\n",
    "        options.step = \"batch\"\n",
    "    \n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "    options.model_save_name = f\"model_fold_{fold}.pt\"\n",
    "    train_eval(options.epochs, model, train_loader, valid_loader,\n",
    "               criterion, optimizer, device, options, lr_scheduler=lr_scheduler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_folds(options):\n",
    "    n_folds = options.n_folds\n",
    "    for i in range(n_folds):\n",
    "        one_fold(fold=i, options=options)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "options = Options()\n",
    "train_folds(options)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_one_model(options):  \n",
    "    test_dataframe = pd.read_csv(\"./input/test.csv\")\n",
    "\n",
    "    bert_model = transformers.DistilBertModel.from_pretrained(options.model_name)\n",
    "    tokenizer = transformers.AutoTokenizer.from_pretrained(options.model_name, use_fast=True)\n",
    "    \n",
    "    test_loader = make_loaders(test_dataframe, tokenizer, mode=\"test\")\n",
    "    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "    model = CustomModel(bert_model, options.num_labels, dropout=options.dropout).to(device)\n",
    "    model.load_state_dict(torch.load(f\"{options.model_path}/{options.model_save_name}\", \n",
    "                                     map_location=device))\n",
    "    model.eval()\n",
    "    \n",
    "    all_preds = None\n",
    "    with torch.no_grad():\n",
    "        for batch in tqdm(test_loader):\n",
    "            batch = {k: v.to(device) for k, v in batch.items()}\n",
    "            preds = model(batch)\n",
    "            if all_preds is None:\n",
    "                all_preds = preds\n",
    "            else:\n",
    "                all_preds = torch.cat([all_preds, preds], dim=0)\n",
    "    \n",
    "    return all_preds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_all_models(options):\n",
    "    n_folds = options.n_folds\n",
    "    all_model_preds = []\n",
    "    for fold in range(n_folds):\n",
    "        options.model_save_name = f\"model_fold_{fold}.pt\"\n",
    "        all_preds = test_one_model(options)\n",
    "        all_model_preds.append(all_preds)\n",
    "    \n",
    "    all_model_preds = torch.stack(all_model_preds, dim=0)\n",
    "    print(all_model_preds.shape)\n",
    "    # I will return the mean of the final predictions of all the models\n",
    "    # You could do other things like 'voting' between the five models\n",
    "    return all_model_preds.mean(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_preds = test_all_models(options)\n",
    "predictions = all_preds.argmax(dim=1).cpu().numpy()\n",
    "sample_submission = pd.read_csv(\"./input/sample_submission.csv\")\n",
    "sample_submission['target'] = predictions\n",
    "sample_submission.to_csv(\"sample_submission.csv\", index=False)\n",
    "pd.read_csv(\"sample_submission.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Thanks for reading my tutorial. I'll be really happy to know what you think about it and if learned something new! Happy Learning!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
